[4/7] Telemetry Event Emission and Aggregation by samikshya-db · Pull Request #327 · databricks/databricks-sql-nodejs

samikshya-db · 2026-01-28T13:11:42Z

Part 4 of 7-part Telemetry Implementation Stack

This layer adds event-driven telemetry emission, per-host shared
aggregation/export, and operation-level error telemetry on top of the
hardened infrastructure shipped in [1/7]–[3/7].

What's in this PR

TelemetryEventEmitter (`lib/telemetry/TelemetryEventEmitter.ts`)

Per-DBSQLClient event emitter — typed emission methods, respects the
client's telemetryEnabled flag, swallows all exceptions at debug level.
Each emitter bridges into the shared aggregator on the per-host
TelemetryClient.

Event types: CONNECTION_OPEN, CONNECTION_CLOSE (new), STATEMENT_START,
STATEMENT_COMPLETE, CLOUDFETCH_CHUNK, ERROR.

TelemetryClient owns the per-host triad (`lib/telemetry/TelemetryClient.ts`)

TelemetryClientProvider is a process-wide singleton. Each host gets one
TelemetryClient that owns:

DatabricksTelemetryExporter
MetricsAggregator
CircuitBreakerRegistry
FeatureFlagCache

Multiple DBSQLClient instances on the same host share these — breaker
counters and HTTP batches don't fragment per-instance. TelemetryClient
implements IClientContext so the owned components have a stable context
that survives any single DBSQLClient's close. Connection/auth providers
are tracked in a FIFO of registered contexts; the exporter falls through
to the next active one when the head closes.

MetricsAggregator (per-host, on `TelemetryClient`)

Restored from main's hardened version, with new functionality layered on:

CONNECTION_CLOSE handling — emits DELETE_SESSION connection metric.
Chunk-timing aggregation — chunkInitialLatencyMs,
chunkSlowestLatencyMs, chunkSumLatencyMs accumulated from
CLOUDFETCH_CHUNK events with positive latency.
Memory bounds: maxPendingMetrics (drop preferring non-error to keep
first-failure signal), maxErrorsPerStatement, statementTtlMs eviction.
Flush triggers: batch size, periodic timer (unref()'d), terminal-error
immediate flush, manual flush().
close() is async and awaits the final HTTP POST so
await client.close(); process.exit(0) doesn't truncate the last batch.

Error telemetry wired into operation entry points

DBSQLOperation now emits ERROR events (with ExceptionClassifier
terminal/retryable classification) from fetchChunk, cancel, close,
and getMetadata. Failed queries produce a STATEMENT_COMPLETE plus an
ERROR proto with error_info: { error_name, stack_trace } (stack run
through redactSensitive).

emitStatementComplete no longer issues a getMetadata Thrift RPC on
close (perf regression + spurious-error-telemetry trap).

Type-safe wiring (`IClientContext`)

Added optional getTelemetryEmitter() / getTelemetryAggregator() to
IClientContext. Removed all (this.context as any) casts at the seven
emit call sites (DBSQLOperation, DBSQLSession, RowSetProvider,
CloudFetchResultHandler).

The six copy-pasted listeners in DBSQLClient.initializeTelemetry are now
one bridge loop over Object.values(TelemetryEventType) — closes the
listener-name mismatch that originally caused error events to be silently
dropped.

mapAuthType covers all six authType values (access-token, databricks-oauth
{U2M/M2M}, custom, token-provider, external-token, static-token).

Verified end-to-end against an Azure Databricks workspace

Healthy SELECT 1 produces 3 wire metrics: CREATE_SESSION (with
system_configuration), STATEMENT_COMPLETE (with
sql_operation.execution_result), DELETE_SESSION.

Failed query produces 4: CREATE_SESSION, STATEMENT_COMPLETE (latency
only), ERROR (with redacted stack), DELETE_SESSION.

Server-side feature flag is still the kill switch — telemetryEnabled: false
on the client also skips the entire pipeline (no acquire/release noise).

Testing

484 unit tests pass (telemetry + DBSQLClient/Operation/Session/result).
Test files for the rebased modules are restored from main. Provider tests
updated for the singleton API.

Coverage gap acknowledged: no unit tests yet for the re-applied
chunk-timing aggregation or CONNECTION_CLOSE handling. Adding these as a
follow-up to the same epic.

Dependencies

Depends on:

[1/7] Telemetry Foundation ([1/7] Telemetry Foundation: Types, Config, and Exception Classifier #324)
[2/7] Infrastructure ([2/7] Telemetry Infrastructure: CircuitBreaker and FeatureFlagCache #325)
[3/7] Client Management ([3/7] Telemetry Client Management: TelemetryClient and Provider #326)

Out of scope (open follow-ups)

Add operation_type proto field so CREATE_SESSION and DELETE_SESSION
are explicitly distinguished on the wire (today they're distinguishable
only by the incidental presence of system_configuration).
Add getStats() for telemetry-pipeline self-observability (drop counts,
queue depth, last-success timestamp).
Expose remaining telemetry config knobs in ConnectionOptions (currently
3 of 13).
Reunify connection.open/connection.close emission (open from DBSQLClient,
close from DBSQLSession).
Unit tests for chunk-timing aggregation and CONNECTION_CLOSE.

samikshya-db · 2026-01-29T08:22:28Z

The emission format confirms to the telemetry proto, marked this ready for review.

This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier

Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling

- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing

- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils

This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier

This is part 3 of 7 in the telemetry implementation stack. Components: - TelemetryClient: HTTP client for telemetry export per host - TelemetryClientProvider: Manages per-host client lifecycle with reference counting TelemetryClient: - Placeholder HTTP client for telemetry export - Per-host isolation for connection pooling - Lifecycle management (open/close) - Ready for future HTTP implementation TelemetryClientProvider: - Reference counting tracks connections per host - Automatically creates clients on first connection - Closes and removes clients when refCount reaches zero - Thread-safe per-host management Design Pattern: - Follows JDBC driver pattern for resource management - One client per host, shared across connections - Efficient resource utilization - Clean lifecycle management Testing: - 31 comprehensive unit tests for TelemetryClient - 31 comprehensive unit tests for TelemetryClientProvider - 100% function coverage, >80% line/branch coverage - Tests verify reference counting and lifecycle Dependencies: - Builds on [1/7] Types and [2/7] Infrastructure

Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling

- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing

- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils

This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier

This is part 3 of 7 in the telemetry implementation stack. Components: - TelemetryClient: HTTP client for telemetry export per host - TelemetryClientProvider: Manages per-host client lifecycle with reference counting TelemetryClient: - Placeholder HTTP client for telemetry export - Per-host isolation for connection pooling - Lifecycle management (open/close) - Ready for future HTTP implementation TelemetryClientProvider: - Reference counting tracks connections per host - Automatically creates clients on first connection - Closes and removes clients when refCount reaches zero - Thread-safe per-host management Design Pattern: - Follows JDBC driver pattern for resource management - One client per host, shared across connections - Efficient resource utilization - Clean lifecycle management Testing: - 31 comprehensive unit tests for TelemetryClient - 31 comprehensive unit tests for TelemetryClientProvider - 100% function coverage, >80% line/branch coverage - Tests verify reference counting and lifecycle Dependencies: - Builds on [1/7] Types and [2/7] Infrastructure

This is part 4 of 7 in the telemetry implementation stack. Components: - TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter - MetricsAggregator: Per-statement aggregation with batch processing TelemetryEventEmitter: - Event-driven architecture using Node.js EventEmitter - Type-safe event emission methods - Respects telemetryEnabled configuration flag - All exceptions swallowed and logged at debug level - Zero impact when disabled Event Types: - connection.open: On successful connection - statement.start: On statement execution - statement.complete: On statement finish - cloudfetch.chunk: On chunk download - error: On exception with terminal classification MetricsAggregator: - Per-statement aggregation by statement_id - Connection events emitted immediately (no aggregation) - Statement events buffered until completeStatement() called - Terminal exceptions flushed immediately - Retryable exceptions buffered until statement complete - Batch size (default 100) triggers flush - Periodic timer (default 5s) triggers flush Batching Strategy: - Optimizes export efficiency - Reduces HTTP overhead - Smart flushing based on error criticality - Memory efficient with bounded buffers Testing: - 31 comprehensive unit tests for TelemetryEventEmitter - 32 comprehensive unit tests for MetricsAggregator - 100% function coverage, >90% line/branch coverage - Tests verify exception swallowing - Tests verify debug-only logging Dependencies: - Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management

Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing

- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils

This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier

This is part 3 of 7 in the telemetry implementation stack. Components: - TelemetryClient: HTTP client for telemetry export per host - TelemetryClientProvider: Manages per-host client lifecycle with reference counting TelemetryClient: - Placeholder HTTP client for telemetry export - Per-host isolation for connection pooling - Lifecycle management (open/close) - Ready for future HTTP implementation TelemetryClientProvider: - Reference counting tracks connections per host - Automatically creates clients on first connection - Closes and removes clients when refCount reaches zero - Thread-safe per-host management Design Pattern: - Follows JDBC driver pattern for resource management - One client per host, shared across connections - Efficient resource utilization - Clean lifecycle management Testing: - 31 comprehensive unit tests for TelemetryClient - 31 comprehensive unit tests for TelemetryClientProvider - 100% function coverage, >80% line/branch coverage - Tests verify reference counting and lifecycle Dependencies: - Builds on [1/7] Types and [2/7] Infrastructure

This is part 4 of 7 in the telemetry implementation stack. Components: - TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter - MetricsAggregator: Per-statement aggregation with batch processing TelemetryEventEmitter: - Event-driven architecture using Node.js EventEmitter - Type-safe event emission methods - Respects telemetryEnabled configuration flag - All exceptions swallowed and logged at debug level - Zero impact when disabled Event Types: - connection.open: On successful connection - statement.start: On statement execution - statement.complete: On statement finish - cloudfetch.chunk: On chunk download - error: On exception with terminal classification MetricsAggregator: - Per-statement aggregation by statement_id - Connection events emitted immediately (no aggregation) - Statement events buffered until completeStatement() called - Terminal exceptions flushed immediately - Retryable exceptions buffered until statement complete - Batch size (default 100) triggers flush - Periodic timer (default 5s) triggers flush Batching Strategy: - Optimizes export efficiency - Reduces HTTP overhead - Smart flushing based on error criticality - Memory efficient with bounded buffers Testing: - 31 comprehensive unit tests for TelemetryEventEmitter - 32 comprehensive unit tests for MetricsAggregator - 100% function coverage, >90% line/branch coverage - Tests verify exception swallowing - Tests verify debug-only logging Dependencies: - Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management

This is part 5 of 7 in the telemetry implementation stack. Components: - DatabricksTelemetryExporter: HTTP export with retry logic and circuit breaker - TelemetryExporterStub: Test stub for integration tests DatabricksTelemetryExporter: - Exports telemetry metrics to Databricks via HTTP POST - Two endpoints: authenticated (/api/2.0/sql/telemetry-ext) and unauthenticated (/api/2.0/sql/telemetry-unauth) - Integrates with CircuitBreaker for per-host endpoint protection - Retry logic with exponential backoff and jitter - Exception classification (terminal vs retryable) Export Flow: 1. Check circuit breaker state (skip if OPEN) 2. Execute with circuit breaker protection 3. Retry on retryable errors with backoff 4. Circuit breaker tracks success/failure 5. All exceptions swallowed and logged at debug level Retry Strategy: - Max retries: 3 (default, configurable) - Exponential backoff: 100ms * 2^attempt - Jitter: Random 0-100ms to prevent thundering herd - Terminal errors: No retry (401, 403, 404, 400) - Retryable errors: Retry with backoff (429, 500, 502, 503, 504) Circuit Breaker Integration: - Success → Record success with circuit breaker - Failure → Record failure with circuit breaker - Circuit OPEN → Skip export, log at debug - Automatic recovery via HALF_OPEN state Critical Requirements: - All exceptions swallowed (NEVER throws) - All logging at LogLevel.debug ONLY - No console logging - Driver continues when telemetry fails Testing: - 24 comprehensive unit tests - 96% statement coverage, 84% branch coverage - Tests verify exception swallowing - Tests verify retry logic - Tests verify circuit breaker integration - TelemetryExporterStub for integration tests Dependencies: - Builds on all previous layers [1/7] through [4/7]

github-actions · 2026-04-29T18:57:01Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

…h, public types, README, test gaps - DBSQLClient.initializeTelemetry: release the per-host refcount on the catch path so a throw between getOrCreateClient and the success log doesn't leak a TelemetryClient (with its flush timer / exporter / FFCache) for the lifetime of the process on long-running supervisors. - TelemetryClient.getClient/getDriver: walk the FIFO with try/catch fallthrough to mirror getConnectionProvider, so a closed-but-not-yet- released head context doesn't take the whole shared pool down. getAuthProvider: return the first defined entry from the FIFO. - lib/index.ts: re-export TelemetryEventType, DEFAULT_TELEMETRY_CONFIG, and the consumer-facing telemetry payload types so SDK users don't need to deep-import for type-checked event/metric handling. - README: add a Telemetry section covering what's collected, the three opt-out paths (env var, programmatic, server-side feature flag), the tunable knobs, and the await-close requirement for short-lived processes. - MetricsAggregator.test.ts: cover chunk-timing aggregation (initial=first-positive, slowest=max, sum=running) and CONNECTION_CLOSE → DELETE_SESSION emission. Both were acknowledged coverage gaps. - TelemetryEventEmitter.test.ts: cover emitConnectionClose — emission shape, disabled-flag suppression, and listener-exception swallow. Co-authored-by: Isaac

github-actions · 2026-04-29T19:55:39Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

vikrantpuppala · 2026-05-04T09:51:14Z

+      safeEmit(this.context, (emitter) => {
+        emitter.emitConnectionClose({
+          sessionId: this.id,
+          latencyMs: Date.now() - this.openTime,


[F1 — Critical] connection.close.latencyMs reports session lifetime, not RPC duration

this.openTime is set in the session constructor (DBSQLSession.ts:175, this.openTime = Date.now()), so this latencyMs is the wall-clock session lifetime, not the closeSession RPC duration. The matching CREATE_SESSION event emits actual openSession RPC duration. Two different definitions land under the same operation_latency_ms field server-side, leading to incorrect close-latency conclusions.

Fix: capture closeStart = Date.now() immediately before await driver.closeSession(...) and emit Date.now() - closeStart. Or rename the field on close (e.g. sessionLifetimeMs) and document both meanings.

Posted by Code Review Squad.

Done in 5669c30 — capture closeStart = Date.now() immediately before await driver.closeSession(...) and emit Date.now() - closeStart in the emitClose() closure. CONNECTION_CLOSE.latencyMs now reports the closeSession RPC duration symmetric with CREATE_SESSION. See lib/DBSQLSession.ts in the new commit.

vikrantpuppala · 2026-05-04T09:51:16Z

+    // Extract workspace ID from hostname (first segment before first dot)
+    const parts = host.split('.');
+    return parts.length > 0 ? parts[0] : host;
+  }


[F2 — Critical] extractWorkspaceId is wrong on AWS and Azure hostnames

host.split('.')[0] returns:

AWS dbc-XXXXX-YYYY.cloud.databricks.com → dbc-XXXXX-YYYY (not the workspace ID; that's the deployment shard prefix)

Azure adb-NNNNNNNNNNNNN.NN.azuredatabricks.net → adb-NNNNNNNNNNNNN (workspace ID is the numeric portion of adb-N, dropping the adb- prefix and trailing form-factor digit)

Every CREATE_SESSION/DELETE_SESSION metric ships a confidently-wrong workspaceId for AWS and Azure customers, silently mis-attributing usage server-side. The JSDoc example workspace-id.cloud.databricks.com is not a real hostname format.

Fix: pull the real workspace ID from the auth response or the connection URL ?o=N parameter. If neither is available, omit the field rather than ship a wrong value. Also: split('.') never "fails" — the or host if extraction fails comment is misleading.

Posted by Code Review Squad.

Done in 5669c30 — extractWorkspaceId() now reads the numeric ?o=N query param from httpPath and returns undefined when no workspace ID can be derived. Host-based extraction is gone (shipped wrong values on AWS/Azure). The emitter signature was loosened to workspaceId?: string so an unknown value is omitted rather than guessed. New unit tests cover AWS/Azure/missing/non-numeric paths.

vikrantpuppala · 2026-05-04T09:51:17Z

  }
+
+  /** @internal */
+  public getTelemetryEmitter(): TelemetryEventEmitter | undefined {


[F3 — High] getTelemetryEmitter() exposes un-redacted error stacks to in-process listeners

@internal is a JSDoc tag, not a runtime guard — this method is reachable from any consumer that imports @databricks/sql. The TelemetryEventEmitter extends Node's EventEmitter, so:

client.getTelemetryEmitter()?.on('telemetry.error', e => exfil(e.errorStack));

works at runtime today. redactSensitive runs only at export time in DatabricksTelemetryExporter.toTelemetryLog, not at emit time, so an in-process listener sees raw errorMessage and errorStack (see DBSQLOperation.ts:592-602). With telemetry now default-on, this is a credential/PII channel that didn't materially exist before.

Fix: redact at the emitter (emitError) before calling this.emit(...). The aggregator/exporter then receive already-redacted strings — single redaction site. Optionally narrow getTelemetryEmitter() to a curated read-only surface.

Posted by Code Review Squad.

Done in 5669c30 — redactSensitive now runs at emit time inside TelemetryEventEmitter.emitError, so any in-process listener wired up via client.getTelemetryEmitter()?.on(...) sees the same redacted strings the export pipeline does. Both errorMessage (F21) and errorStack (F3) are redacted. The exporter keeps a second-pass redaction as defence-in-depth in case a metric reaches it via a path that bypassed the emitter.

vikrantpuppala · 2026-05-04T09:51:19Z

+      }
+    }
+    return undefined;
+  }


[F4 — High] Multi-tenant FIFO credential bleed across DBSQLClients on same host

A SaaS layer fronting tenants A and B against the same workspace host shares one TelemetryClient. Telemetry from tenant B's queries can be POSTed under tenant A's auth headers (whoever connected first wins the FIFO head). Same shape applies to userAgentEntry and telemetryAuthenticatedExport — tenant A's value silently wins. If tenant B explicitly set telemetryAuthenticatedExport: false, B's events ride A's authenticated pipeline.

Fix options (pick one):

Shard the registry by (host, userAgentEntry, authenticatedExport, authProvider-identity) instead of host alone.

Auto-disable telemetry when a second context registers with diverging auth/authenticatedExport (lose telemetry > leak credential).

Per-context exporter with shared circuit-breaker counters.

At minimum, document this prominently in the README opt-out section so SaaS deployments know to set telemetryEnabled: false.

Posted by Code Review Squad.

Documented in 5669c30 — added a prominent "Multi-tenant SaaS deployments" section to the README opt-out area explaining the FIFO sharing model and recommending telemetryEnabled: false or per-host partitioning. Also added a warn-level log in TelemetryClient.registerContext when a second registrant brings a different auth provider, so the leak shape is at least visible in logs. A proper registry shard by (host, auth-identity) is left as a follow-up — the documentation + warn is the safe-default fix.

vikrantpuppala · 2026-05-04T09:51:20Z

+    this.contexts = this.contexts.filter((c) => c !== context);
+    const auth = context.getAuthProvider?.();
+    if (auth) {
+      this.authProviders = this.authProviders.filter((a) => a !== auth);


[F5 — High] Auth-provider deregistration uses identity at unregister-time → stale credential leak on rotation

unregisterContext calls context.getAuthProvider?.() at unregister time and filters by referential equality. If the context's auth provider was rotated between register and unregister (token refresh that reconstructs the provider, lazy wrappers, anything that doesn't return the same object reference), the original provider is never removed from this.authProviders. The exporter's FIFO walk in getAuthProvider() (line 203-215) can keep authenticating with revoked credentials for the lifetime of the per-host singleton.

Fix: cache the auth-provider snapshot on the contexts entry at register time (registerContext); use that cached snapshot at unregister time, not a fresh call.

Posted by Code Review Squad.

Done in 5669c30 — contexts is now Array<{context, authProvider}> with the auth provider snapshotted at registerContext time. unregisterContext filters by the cached context reference and the cached authProvider rides along, so a rotated/reconstructed auth provider can no longer leak past unregister. Also added tryFallthrough which prunes throwing contexts from the FIFO (addresses F17).

vikrantpuppala · 2026-05-04T09:51:22Z

+      telemetryMaxRetries: 3,
+      telemetryAuthenticatedExport: true,
+      telemetryCircuitBreakerThreshold: 5,
+      telemetryCircuitBreakerTimeout: 60000, // 1 minute


[F9 — High] Default-config drift: 7 keys here vs 15 in DEFAULT_TELEMETRY_CONFIG

getDefaultConfig (this method, lines 128-134) defines 7 telemetry keys. DEFAULT_TELEMETRY_CONFIG in lib/telemetry/types.ts:91-103 defines 15. Components like MetricsAggregator fall back through ?? DEFAULT_TELEMETRY_CONFIG.X for the missing keys. Today the values agree by hand; they will silently desync the next time someone changes one source but not the other.

Fix: spread DEFAULT_TELEMETRY_CONFIG here (with a typed mapper for the telemetry-prefixed ClientConfig keys), or stop maintaining the inline copy and route every component through the single frozen const.

Posted by Code Review Squad.

Done in 5669c30 — getDefaultConfig() now sources all 15 telemetry keys from DEFAULT_TELEMETRY_CONFIG (with the unprefixed → telemetry-prefixed key mapping done once). Single source of truth; adding a new knob now requires one change to the frozen const and the public option surface.

vikrantpuppala · 2026-05-04T09:51:23Z

+    for (const k of telemetryOverrides) {
+      if (options[k] !== undefined) {
+        // The narrow union forces a cast; values are validated at point of use.
+        (this.config as any)[k] = options[k];


[F11 — Medium] as any cast removes type safety on every telemetry option override

The telemetryOverrides array above is as const so the keys are a literal union. Both ConnectionOptions and ClientConfig have identical key names and types for the telemetry knobs. The cast escapes the structural type system for no reason. A user passing telemetryBatchSize: "100" (string) silently writes a string into a number field; MetricsAggregator then reads it as number and aggregation thresholds break at runtime.

Fix: per-key narrowed assignments, or a small typed pickDefined<T, K>(dst: T, src: Partial<T>, keys: readonly K[]) helper. Same readability, no cast.

Posted by Code Review Squad.

Done in 5669c30 — replaced the as any cast with a typed copyDefinedTelemetryOptions(src: ConnectionOptions, dst: ClientConfig) helper that performs per-key narrowed assignments. A wrong-shape value (telemetryBatchSize: "100") now fails the TS check at the user call site instead of silently writing a string into a number field.

vikrantpuppala · 2026-05-04T09:51:25Z

+  /** @internal */
+  public getTelemetryAggregator() {
+    return this.telemetryClient?.getAggregator();
+  }


[F18 — Medium] getTelemetryAggregator() lacks explicit return-type annotation

Three reviewers flagged this. The return type is inferred as MetricsAggregator | undefined (from this.telemetryClient?.getAggregator()), but MetricsAggregator is intentionally not re-exported from lib/index.ts. Consumers calling client.getTelemetryAggregator() see a method whose return type references a non-exported symbol — autocomplete degrades, agent-completion gets confused, and the contract diverges from IClientContext.getTelemetryAggregator?(): MetricsAggregator | undefined even though the line two above (getTelemetryEmitter()) does annotate explicitly.

Fix: add : MetricsAggregator | undefined to match IClientContext. Since the method is @internal, an even cleaner option is to remove it from the public class entirely (move to a friend interface used only within the package) — TS-public methods marked only with a JSDoc @internal are the worst of both worlds for autocomplete.

Posted by Code Review Squad.

Done in 5669c30 — explicit getTelemetryAggregator(): MetricsAggregator | undefined annotation matching IClientContext. Also added a public companion getTelemetryStats(): {pendingMetricsCount, inFlightStatements, droppedMetrics, evictedStatements, circuitBreakerState, ...} | undefined (F13) so operators can read telemetry health without grepping debug logs.

vikrantpuppala · 2026-05-04T09:51:32Z

+    // charge — avoiding the footgun where a sysadmin sets the var to "false"
+    // expecting to enable telemetry.
+    const envKill = process.env.DATABRICKS_TELEMETRY_DISABLED;
+    const envDisabled = typeof envKill === 'string' && /^(1|true|yes|on)$/i.test(envKill.trim());


[F22 — Medium] DATABRICKS_TELEMETRY_DISABLED=false keeps telemetry on — silently ignored

The regex /^(1|true|yes|on)$/i is the deliberate "footgun-avoidance" choice (so false/0/no don't accidentally disable opt-out), and the comment above explains it. The remaining UX problem: an ops engineer who sees the env var and tries to "set it to false to keep telemetry on" gets the opposite of what they expect — false is silently ignored, leaving runtime config (default true) in charge.

Fix: log a LogLevel.warn line when the env var is set to a non-recognized non-empty value (false, 0, no, off, etc.) so ops can see their disable didn't take effect. That preserves the safe-default behavior while surfacing the misconfiguration.

Posted by Code Review Squad.

Done in 5669c30 — when DATABRICKS_TELEMETRY_DISABLED is set to a non-empty value that isn't in the recognized truthy set (1/true/yes/on, case-insensitive), the driver logs a LogLevel.warn line stating the value was ignored and listing the recognized forms. The safe-default behavior is preserved (no accidental opt-in), but ops can now see the disable-failed shape in logs. Unit tests cover all the unrecognized values (0/false/no/off/False/OFF) and the recognized ones.

vikrantpuppala · 2026-05-04T09:51:34Z

+      if (arrowBatches) {
+        for (const batch of arrowBatches) {
+          bytes += batch.batch?.length ?? 0;
+        }


[F10 — Medium] RowSetProvider.emitChunkEvent undercounts bytes for non-arrow paths

bytes is computed only from response.results?.arrowBatches. A TFetchResultsResp carrying inline JSON / column-based data (COLUMN_BASED_SET) reports bytes: 0. URL-based result sets report bytes via CloudFetchResultHandler (a separate emit site), so URL-based is fine. But column/json paths flow through here and the metric is then aggregated into bytesDownloaded in MetricsAggregator.processStatementEvent (details.bytesDownloaded += event.bytes ?? 0).

Net effect: dashboards that aggregate bytesDownloaded will be silently undercounted whenever the result is inline JSON / column-based.

Fix: compute and account inline-result bytes here for non-arrow paths, or document explicitly that bytesDownloaded from this emit site counts only arrow-batch payloads (so dashboards know to add the CloudFetch-emitted byte counts separately).

Posted by Code Review Squad.

Done in 5669c30 — emitChunkEvent now sums bytes across every TFetchResultsResp shape: arrowBatches[i].batch (existing), binaryColumns, and per-column values via a new estimateColumnBytes helper that handles all 8 TColumn value types plus their nulls buffers. URL-based result sets continue to emit from CloudFetchResultHandler.emitCloudFetchChunk with the post-download byte count. New tests cover both COLUMN_BASED_SET and binary-column paths.

vikrantpuppala · 2026-05-04T09:51:36Z

+    safeEmit(this.context, (emitter) => {
+      emitter.emitError({
+        statementId: this.id,
+        sessionId: this.sessionId,


[F12 — Medium] sessionId default mismatch — '' here, undefined in emitErrorEvent

emitStatementStart (line 549) and emitStatementComplete (line 575) emit sessionId: this.sessionId || ''. emitErrorEvent (line 596) emits sessionId: this.sessionId (no || ''). The aggregator's per-statement state groups by sessionId, so an operation with no sessionId emits start/complete keyed under empty string and errors keyed under undefined — split into two synthetic sessions in the aggregator and exporter.

Fix: pick one default and apply consistently. Prefer undefined and let the aggregator filter at the processEvent boundary (cleaner than synthesizing fake empty-string sessions).

Posted by Code Review Squad.

Done in 5669c30 — emitStatementStart/emitStatementComplete now pass this.sessionId directly (no || '' coercion), aligned with emitErrorEvent. Emitter signatures and StatementTelemetryDetails updated to accept sessionId?: string, so an operation that briefly lacks a sessionId no longer splits into two synthetic buckets in the aggregator.

vikrantpuppala · 2026-05-04T09:53:21Z

Code Review Squad Report

Scope: 29 files, +2970/-424 lines
Iteration: 1
Coverage: 9/9 reviewers delivered (security, devils-advocate, language, test, architecture, maintainability, agent-compat, ops, performance)
Merge Safety Score: 0/100 — CRITICAL RISK (deduction-floored by volume of findings; substance below)

Executive Summary

This is a sizable, mostly well-engineered telemetry layer that ships two verified data-integrity bugs that mis-stamp every connection-close metric and every workspace-attribution field on AWS+Azure customers, plus multi-tenant safety issues (FIFO auth-provider sharing, silent override of opt-out config) that materially undermine the privacy posture of a default-on feature. Tests for the +314-line DBSQLClient changes are essentially absent.

None of these are unrecoverable — most have small, focused fixes — but they should land before this merges and ships to customers as part of the 1.13 default-on telemetry rollout.

Critical & High findings (inline comments above)

ID	Severity	Title
F1	Critical	`connection.close.latencyMs` reports session lifetime, not RPC duration (`DBSQLSession.ts:596`)
F2	Critical	`extractWorkspaceId` is wrong on AWS+Azure hostnames (`DBSQLClient.ts:255`)
F3	High	`getTelemetryEmitter()` exposes un-redacted error stacks to in-process listeners (`DBSQLClient.ts:678`)
F4	High	Multi-tenant FIFO credential bleed across DBSQLClients on same host (`TelemetryClient.ts:215`)
F5	High	Auth-provider deregistration uses identity at unregister-time → stale credential leak on rotation (`TelemetryClient.ts:145`)
F9	High	Default-config drift: 7 keys in `getDefaultConfig` vs 15 in `DEFAULT_TELEMETRY_CONFIG` (`DBSQLClient.ts:134`)

High findings without a specific anchor

F6 — Default-on telemetry ships customer-controlled userAgentEntry un-redacted to the telemetry endpoint.
DatabricksTelemetryExporter.ts:119 concatenates userAgentEntry into the User-Agent header on every telemetry POST. The existing UA redactor only catches Databricks token prefixes — emails, JWTs, OAuth client_secret values, internal system codes, anything else customers stuff into userAgentEntry ships in the User-Agent header. With telemetry default-on, this shifts from opt-in risk to default risk. Fix: run redactSensitive over userAgentEntry before it lands in the UA, or document explicitly that anything passed to userAgentEntry is shipped to the telemetry endpoint.

F7 — DBSQLClient.ts +314 lines has zero new unit tests.
tests/unit/DBSQLClient.test.ts is unchanged in this PR. Untested code paths: initializeTelemetry (the refcount-release fix that motivated commit 7d10d1e), getTelemetryEmitter, getTelemetryAggregator, extractWorkspaceId, buildDriverConfiguration, mapAuthType, getLocaleName, getProcessName, the DATABRICKS_TELEMETRY_DISABLED env-var parser, the reconnect refcount-release path. Same gap for DBSQLSession.close CONNECTION_CLOSE emission and CloudFetchResultHandler.emitCloudFetchChunk. Fix: add unit tests with stubbed TelemetryClient/TelemetryClientProvider for at least the env-kill parser, reconnect refcount path, CONNECTION_OPEN/CLOSE emission, and the CloudFetch chunk emit site.

F8 — Telemetry HTTP retry double-stacks with user's 15-min retriesTimeout.
The exporter's sendRequest goes through connectionProvider.getRetryPolicy() → HttpRetryPolicy.invokeWithRetry, which is bounded by the customer's ClientConfig.retriesTimeout (15 min default) and retryMaxAttempts (5). On top of that the exporter has its own retry loop (telemetryMaxRetries=3, telemetryBackoffMaxMs=1s). Worst-case single-batch retry budget reaches ~15 min wall-clock per outer attempt × 3 outer ≈ ~15 min before the circuit breaker even sees the failure. During that window flushInFlight blocks subsequent flushes; addPendingMetric silently drops past the 500 cap. Fix: in DatabricksTelemetryExporter.sendRequest, override the retry policy to telemetry-specific bounds (or pass a NullRetryPolicy).

Medium / Low findings (no inline comment, summarized here)

F13 — Drop counter / queue depth / breaker state are invisible to operators. addPendingMetric drops are at LogLevel.debug only; default driver log level is info. Add droppedMetrics counter, warn-level summary on flush when > 0, and a getTelemetryStats() API.
F14 — Untyped EventEmitter; listener payloads inferred as any in TelemetryEventEmitter and the listener wired up in DBSQLClient.initializeTelemetry. Either typed on<K> overloads, or drop the EventEmitter inheritance entirely (the bridge has one listener per type).
F15 — CONNECTION_OPEN emitted per openSession, not per connect() — A long-running DBSQLClient opening 1000 sessions emits 1000 CREATE_SESSION events each with a ~1KB driverConfig blob. Strip static driverConfig from events 2..N or emit once per connect().
F16 — safeEmit typed as any despite no real cycle. Comment cites "dependency cycle through IClientContext / TelemetryEventEmitter," but IClientContext already does import type TelemetryEventEmitter. A typo like emitter.emiitConnectionOpen(...) slips through and fails at runtime as a swallowed exception. Type-only imports both ways and safeEmit becomes properly typed.
F17 — TelemetryClient fall-through helpers (getDriver/getClient/getConnectionProvider) don't prune throwing contexts. A throwing context (revoked auth, closed client) stays in this.contexts and is retried on every export indefinitely.
F19 — Stack-trace redaction misses common non-home paths: /var/lib/<service>/..., /opt/..., container /app/... (Lambda/k8s/Cloud Run common), \\server\share\..., macOS /private/var/folders/.... The Bearer/JWT patterns are correctly implemented; the path patterns are too narrow.
F20 — e2e telemetry-integration tests are spy-only. Asserts only that internal sinon spies were called; no ingestion verification. "Reference Counting" uses at.least(2). "Cleanup on close" uses disjunction over 3 spies. One spy created and never asserted (telemetryProviderSpy). No before() skip-guard on missing creds — runner hard-exits in unconfigured CI via validateConfig's process.exit(1).
F21 — No redaction of operation errorMessage body content. DBSQLOperation.ts:597-600: errorMessage: error.message || 'Unknown error' flows through verbatim. README claims "non-PII error context" but errorMessage strings can contain query fragments, table names, hostnames, parameter values that the SECRET_PATTERNS regex doesn't catch. errorStack runs through redactSensitive at export time, but not the message body.
F23 — Workspace-id casing inconsistency. Registry lowercases host (TelemetryClientProvider.normalizeHostKey); metric reports original case from extractWorkspaceId(this.host). Workspace-A.cloud.databricks.com registers under workspace-a but ships workspaceId: 'Workspace-A'.
F24 — Documented vs code defaults reconciliation. Have one canonical source for documented defaults (the README either stays abstract or quotes from DEFAULT_TELEMETRY_CONFIG).
F26 — chunkSumLatencyMs reported only when > 0 while chunkCount always reported (MetricsAggregator). 5 chunks / 0ms in dashboards is confusing.
F27 — compressionEnabled only carries last chunk's value — model mismatch with mixed-compression batches.
F28 — Redundant Promise.resolve(this.flush(false)).catch(...) wrapping in MetricsAggregator (3 sites). flush() already returns Promise<void>; drop the Promise.resolve wrapper.
F29 — No process.on('beforeExit', flush) hook to mitigate process.exit(0) data loss. Optional defensive add (gated by a flag).
F30 — TelemetryClientProvider.resetInstance() is reachable in production code. Rename to __resetInstanceForTests to discourage use.
F31 — IClientContext optional-method approach is a smell across many call sites; an ITelemetrySink interface would be cleaner long-term.
F32 — telemetryMaxPendingMetrics is read by aggregator from ClientConfig but not exposed in ConnectionOptions — users have no public knob to raise it.
F36 — Six near-identical emit* methods in TelemetryEventEmitter could share a private emitWrapped(type, build) helper. ~170 of 220 lines are scaffold.

Verified non-issues (confirmed against PR head; flagged so reviewers don't re-raise)

No "6 placeholder event handlers" in DBSQLClient.ts. They don't exist; the listener registration is an enum-driven bridge into the shared aggregator. (Confirmed by 4 reviewers: devils-advocate, maintainability, agent-compat, ops.)
No getTelemetryAggregator(): any. TS infers MetricsAggregator | undefined; the issue is missing explicit annotation (F18), not an any return.
No createHash('sha256') / connectionId hashing anywhere in this PR — the brief was wrong on this point.
HTTPS-only telemetry URL with comprehensive SSRF deny-list for IMDS/loopback/RFC1918/link-local is solid.
func-names: off in tests is legitimate for the Mocha function () { this.timeout(...) } idiom.
Failure isolation: telemetry never blocks the data path; safeEmit swallows; withErrorTelemetry re-throws after emit. Strong (multiple reviewers verified).
Bearer <token> redaction works on space-separated form (devils-advocate's H8 was wrong on this — the regex is correct).
TelemetryEventEmitter and TelemetryClientProvider test rewrites are coverage supersets, not abandonment.
Per-host shared singleton aggregator + 1 timer/host is well-engineered for scalability; performance review found no regressions on the data path.
Memory bounds (maxPendingMetrics, maxErrorsPerStatement, maxStatementMetrics, statementTtlMs) are correctly enforced at insert/eviction time.
Circuit breaker is per-host, with synchronous tryAdmit for HALF_OPEN probes (no probe race), counter resets on success.

Feedback? Drop it in #code-review-squad-feedback.

Critical/High: - F1 close-latency now measures closeSession RPC duration, not session lifetime - F2 extractWorkspaceId reads ?o=N from httpPath; returns undefined on miss (host-based extraction shipped wrong values on AWS/Azure) - F3 redact errorStack/errorMessage at emit time so in-process EventEmitter listeners see the same redacted strings the export pipeline does - F4 README "Multi-tenant SaaS deployments" section + warn on diverging auth providers across registrants on the same host - F5 cache auth-provider snapshot at register time; unregister uses cached reference so rotated providers don't leak in the FIFO - F6 redactSensitive(userAgentEntry) before UA construction in exporter - F8 telemetry retry no longer stacks on customer retriesTimeout - F9 getDefaultConfig sources all 15 telemetry keys from DEFAULT_TELEMETRY_CONFIG Mediums: - F10 RowSetProvider.emitChunkEvent counts bytes for column-based, binary-column, and arrow shapes (was arrow-only) - F11 copyDefinedTelemetryOptions helper replaces `as any` cast on overrides - F12 sessionId default consistently undefined across emit sites - F17 prune throwing contexts in TelemetryClient fall-through helpers - F18 explicit MetricsAggregator|undefined return annotation - F19 redaction patterns broadened: /var/lib, /opt, /srv, /app, UNC, macOS /var/folders - F20 e2e telemetry test gets skip-guard for missing creds + tightened assertions (counts, no disjunction-over-spies) - F21 errorMessage redacted at emit time (not only errorStack) - F22 warn on unrecognized DATABRICKS_TELEMETRY_DISABLED values Low / cleanup: - F13 getTelemetryStats() API + warn-level capacity-event summary - F14 typed on/off/once overloads on TelemetryEventEmitter - F15 ship driverConfig only on first openSession per client - F16 typed safeEmit (no more `any` for emitter/context) - F23 superseded by F2 (workspaceId is numeric now) - F24 README defaults section stays abstract — keys reference const - F26 chunkSumLatencyMs aligned with chunkCount presence - F27 compressionEnabled is sticky-OR across chunks, not last-wins - F28 drop redundant Promise.resolve(this.flush(...)) wraps - F29 optional telemetryFlushOnExit beforeExit hook, detached in close - F30 resetInstance renamed __resetInstanceForTests - F31 noted ITelemetrySink follow-up on IClientContext - F32 telemetryMaxPendingMetrics exposed on ConnectionOptions - F36 emitWrapped helper consolidates the per-method scaffold Tests: 26 new DBSQLClient unit tests (env-kill, extractWorkspaceId, refcount on reconnect/init-failure, F15 dedup, getTelemetryStats). 2 new RowSetProvider tests for non-arrow byte counting. e2e test rewritten with skip-guard + tightened assertions. Existing tests updated for new log-message format and snapshot-pair contexts. Co-authored-by: Isaac

github-actions · 2026-05-11T13:15:52Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

samikshya-db · 2026-05-11T13:21:04Z

Thanks for the thorough review — 5669c30 addresses all 32 findings (2 Critical, 7 High, 16 Medium/Low/Cleanup). Inline replies on each anchored thread; summary below for the unanchored ones.

Critical & High — addressed

F1 connection.close.latencyMs — capture closeStart immediately before closeSession RPC, emit Date.now() - closeStart.
F2 extractWorkspaceId — read numeric ?o=N from httpPath; return undefined rather than guess.
F3 / F21 Redaction moved to TelemetryEventEmitter.emitError so in-process listeners see the same redacted strings the exporter does; redaction now covers errorMessage body, not just errorStack.
F4 README "Multi-tenant SaaS deployments" section + warn on diverging auth providers at registration.
F5 Auth provider snapshotted at registerContext and cached on the contexts entry; unregister uses the cached reference, so a rotated provider can't leak past unregister.
F6 userAgentEntry runs through redactSensitive(..., 256) before UA construction in the telemetry exporter.
F7 New tests/unit/DBSQLClient.test.ts block covers env-kill parser (recognized + unrecognized values), extractWorkspaceId, reconnect refcount release, init-failure refcount release, driverConfig dedup, and getTelemetryStats. New tests/unit/result/RowSetProvider.test.ts cases cover the non-arrow byte counting.
F8 DatabricksTelemetryExporter.sendRequest no longer wraps connectionProvider.getRetryPolicy(). Single retry source = exportWithRetry with telemetry-specific bounds; the worst-case ~45 min stacked retry budget is gone.
F9 / F24 getDefaultConfig spreads from DEFAULT_TELEMETRY_CONFIG for all 15 keys; no more 7-vs-15 silent desync.

Medium / Low — addressed

F10 emitChunkEvent sums bytes across all TFetchResultsResp shapes (arrow, binary columns, COLUMN_BASED_SET via estimateColumnBytes).
F11 Replaced as any with typed copyDefinedTelemetryOptions.
F12 sessionId defaults to undefined at all emit sites; aggregator + emitter signatures accept sessionId?: string.
F13 New getTelemetryStats() on DBSQLClient (and TelemetryClient.getTelemetryStats()); warn-level summary on each flush when drops/evictions occur.
F14 Typed on/off/once overloads on TelemetryEventEmitter against a TelemetryEventMap.
F15 driverConfig blob is shipped only on the first openSession per client; subsequent sessions pass undefined. Re-armed on connect().
F16 safeEmit now uses type-only imports of IClientContext / TelemetryEventEmitter — no any anywhere, typo'd emitter method names fail at compile.
F17 tryFallthrough helper prunes throwing contexts from the FIFO so revoked-auth registrants aren't retried on every export.
F18 Explicit getTelemetryAggregator(): MetricsAggregator | undefined.
F19 Redaction patterns broadened: /var/lib/<svc>/, /opt/<dir>/, /srv/<dir>/, /app/<dir>/, UNC \\server\share\, macOS /var/folders/.
F20 e2e test skip-guards on missing creds (no more process.exit(1) import-time hard-fail); count-based assertions replace at.least(2) and disjunction-over-spies; unused spy removed.
F22 Warn when DATABRICKS_TELEMETRY_DISABLED is set to an unrecognized non-empty value.
F23 Superseded by F2 (workspaceId is now numeric, no case issue).
F26 chunkSumLatencyMs reported alongside chunkCount whenever there are chunks (was omitted when zero, producing "5 chunks / 0ms" rows).
F27 compressionEnabled is sticky-OR across chunks ("any compressed → true") instead of last-wins.
F28 Dropped 3 redundant Promise.resolve(this.flush(...)) wraps.
F29 Optional telemetryFlushOnExit config knob installs a beforeExit flush hook; detached in close() so long-running hosts don't accumulate listeners.
F30 TelemetryClientProvider.resetInstance → __resetInstanceForTests.
F31 Noted on IClientContext as a follow-up — ITelemetrySink extraction left out of this PR to keep scope contained.
F32 telemetryMaxPendingMetrics added to ConnectionOptions.
F36 Six per-event try/catch blocks consolidated through emitWrapped(type, build).

Test results

npx tsc --noEmit: clean on every file I touched.
npx eslint lib/**/*.ts: clean.
429 unit tests pass across DBSQLClient, DBSQLOperation, telemetry, and result handlers.
Pre-existing failures (AuthorizationCode.test.ts Symbol.asyncDispose typing, OAuthManager.test.ts FAPI2Client typing, LZ4 native-binding load on this host) reproduce on main and are unrelated to this commit.

Note: this PR depends on PR #326 (Telemetry Client Management) — the multi-context auth-provider refactor is built on top of TelemetryClient from that PR.

github-actions · 2026-05-11T13:24:34Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

- Re-apply prettier formatting on lib/DBSQLClient.ts, lib/result/CloudFetchResultHandler.ts, lib/result/RowSetProvider.ts (eslint --fix in the prior commit re-introduced quote/destructuring formatting that prettier rewrote back). - Revert the loosening of the "share telemetry client across multiple connections" e2e assertions back to exact-count form now that the PECO test workspace's telemetry feature flag is on. Strict count catches refcount regressions an at.least() assertion would hide. Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

github-actions · 2026-05-11T13:54:22Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

…sert CI on the previous commit showed two e2e failures driven by state that leaks across e2e test files, not by real defects: 1. `should initialize telemetry when telemetryEnabled is true` — `featureFlagCacheSpy.called` was false because a prior e2e suite had already created a TelemetryClient for the test host. Subsequent `getOrCreateClient` calls hit the existing holder and skip the constructor (and the FFCache.getOrCreateContext call inside it). Fixed by `before` + `beforeEach` `__resetInstanceForTests()` so this suite always sees a fresh provider regardless of execution order. That also clears the "second context, different auth" F4 warn that was firing on every test in this suite. 2. `should cleanup telemetry on close` — `MetricsAggregator.flush()` isn't called by the close-drain when `pendingMetrics` is empty (the test connects + closes without ever `openSession`-ing, so no events are queued). Replaced the strict `flushSpy.called` assert with `aggregatorCloseSpy.called` — `MetricsAggregator.close()` is the cleanup surface that always runs, regardless of buffer state. Also: change the outer `describe` from `function ()` to arrow form (eslint `prefer-arrow-callback`); only `before`/`it` need the function-form for `this`. Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

github-actions · 2026-05-11T14:12:59Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

This was referenced Jan 28, 2026

[5/7] Telemetry Export: DatabricksTelemetryExporter with Retry and Circuit Breaker #328

Merged

[6/7] Telemetry Integration: Wire into Driver Components #329

Merged

[7/7] Telemetry Testing and Documentation - FINAL LAYER #330

Merged

samikshya-db marked this pull request as draft January 28, 2026 22:41

samikshya-db marked this pull request as ready for review January 29, 2026 08:22

samikshya-db requested a review from jadewang-db January 29, 2026 09:01

samikshya-db and others added 11 commits January 29, 2026 20:20

Fix feature flag and telemetry export endpoints

c3779af

- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling

Match JDBC telemetry payload format

a105777

- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing

Fix lint errors

1cdb716

- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils

Fix feature flag and telemetry export endpoints

4437ae9

- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling

Match JDBC telemetry payload format

e9c3138

- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing

Fix lint errors

32003e9

- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils

samikshya-db force-pushed the telemetry-3-client-management branch from 87d1e85 to 32003e9 Compare January 29, 2026 20:21

samikshya-db and others added 11 commits January 29, 2026 20:21

Match JDBC telemetry payload format

1b49399

- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing

Fix lint errors

29376a6

- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils

samikshya-db temporarily deployed to azure-prod April 29, 2026 19:54 — with GitHub Actions Inactive

vikrantpuppala reviewed May 4, 2026

View reviewed changes

samikshya-db had a problem deploying to azure-prod May 11, 2026 13:15 — with GitHub Actions Failure

Merge branch 'main' into telemetry-4-event-aggregation

4750bc2

samikshya-db had a problem deploying to azure-prod May 11, 2026 13:23 — with GitHub Actions Failure

samikshya-db had a problem deploying to azure-prod May 11, 2026 13:54 — with GitHub Actions Failure

samikshya-db deployed to azure-prod May 11, 2026 14:12 — with GitHub Actions Active

samikshya-db requested a review from vikrantpuppala May 12, 2026 16:00

Conversation

samikshya-db commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Part 4 of 7-part Telemetry Implementation Stack

What's in this PR

TelemetryEventEmitter (lib/telemetry/TelemetryEventEmitter.ts)

TelemetryClient owns the per-host triad (lib/telemetry/TelemetryClient.ts)

MetricsAggregator (per-host, on TelemetryClient)

Error telemetry wired into operation entry points

Type-safe wiring (IClientContext)

Verified end-to-end against an Azure Databricks workspace

Testing

Dependencies

Out of scope (open follow-ups)

Uh oh!

samikshya-db commented Jan 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vikrantpuppala commented May 4, 2026

Code Review Squad Report

Executive Summary

Critical & High findings (inline comments above)

High findings without a specific anchor

Medium / Low findings (no inline comment, summarized here)

Verified non-issues (confirmed against PR head; flagged so reviewers don't re-raise)

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

samikshya-db commented May 11, 2026

Critical & High — addressed

Medium / Low — addressed

Test results

Uh oh!

samikshya-db commented Jan 28, 2026 •

edited

Loading

TelemetryEventEmitter (`lib/telemetry/TelemetryEventEmitter.ts`)

TelemetryClient owns the per-host triad (`lib/telemetry/TelemetryClient.ts`)

MetricsAggregator (per-host, on `TelemetryClient`)

Type-safe wiring (`IClientContext`)